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CHAPTER 1 


Why Data Structures Matter 


When people first learn to code, their focus is—and should be—on getting 
their code to run properly. Their code is measured using one simple metric: 
does the code actually work? 


As software engineers gain more experience, though, they begin to learn about 
additional layers and nuances regarding the quality of their code. They learn 
that there can be two snippets of code that both accomplish the same task, 
but that one snippet is better than the other. 


There are numerous measures of code quality. One important measure is 
code maintainability. Maintainability of code involves aspects such as the 
readability, organization, and modularity of one’s code. 


However, there’s another aspect of high-quality code, and that is code efficien- 
cy. For example, you can have two code snippets that both achieve the same 
goal, but one runs faster than the other. 


Take a look at these two functions, both of which print all the even numbers 
from 2 to 100: 


def print_numbers_version_one(): 
number = 2 


while number <= 100: 
# If number is even, print it: 
if number % 2 == 0: 
print (number) 


number += 1 
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def print_numbers_version_two(): 
number = 2 


while number <= 100: 
print (number) 


# Increase number by 2, which, by definition, 
# is the next even number: 
number += 2 


Which of these functions do you think runs faster? 


If you said Version 2, you're right. This is because Version 1 ends up looping 
100 times, while Version 2 only loops 50 times. The first version then, takes 
twice as many steps as the second version. 


This book is about writing efficient code. Having the ability to write code that 
runs quickly is an important aspect of becoming a better software developer. 


The first step in writing fast code is to understand what data structures are 
and how different data structures can affect the speed of our code. So, let’s 
dive in. 


Data Structures 


Let’s talk about data. 


Data is a broad term that refers to all types of information, down to the most 
basic numbers and strings. In the simple but classic “Hello World!” program, 
the string "Hello World!" is a piece of data. In fact, even the most complex pieces 
of data usually break down into a bunch of numbers and strings. 


Data structures refer to how data is organized. Youre going to learn how the 
same data can be organized in a variety of ways. 


Let’s look at the following code: 


x = "Hello! " 
y = "How are you " 
z = "today?" 


print x+y +z 


This simple program deals with three pieces of data, outputting three strings 
to make one coherent message. If we were to describe how the data is orga- 
nized in this program, we’d say that we have three independent strings, each 
contained within a single variable. 
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However, this same data can also be stored in an array: 


array = ["Hello! ", "How are you ", "today?"] 


print array[0] + array[1] + array[2] 


You're going to learn in this book that the organization of data doesn’t just 
matter for organization’s sake, but can significantly impact how fast your code 
runs. Depending on how you choose to organize your data, your program may 
run faster or slower by orders of magnitude. And if you're building a program 
that needs to deal with lots of data, or a web app used by thousands of people 
simultaneously, the data structures you select may affect whether your soft- 
ware runs at all, or simply conks out because it can’t handle the load. 


When you have a solid grasp on data structures’ performance implications 
on the software you are creating, you will have the keys to write fast and 
elegant code, and your expertise as a software engineer will be greatly 
enhanced. 


In this chapter, we're going to begin our analysis of two data structures: arrays 
and sets. While the two data structures may seem almost identical, you're 
going to learn the tools to analyze the performance implications of each choice. 


The Array: The Foundational Data Structure 


The array is one of the most basic data structures in computer science. I 
assume you have worked with arrays before, so you are aware that an array 
is a list of data elements. The array is versatile, and can serve as a useful tool 
in many situations, but let’s take a look at one quick example. 


If you are looking at the source code for an application that allows users to 
create and use shopping lists for the grocery store, you might find code 
like this: 


array = ["apples", "bananas", "cucumbers", "dates", "“elderberries"] 


This array happens to contain five strings, each representing something that 
I might buy at the supermarket. (You’ve got to try elderberries.) 


Arrays come with their own technical jargon. 


The size of an array is how many data elements the array holds. Our grocery 
list array has a size of 5, since it contains five values. 


The index of an array is the number that identifies where a piece of data lives 
inside the array. 
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In most programming languages, we begin counting the index at 0. So, for our 
example array, "apples" is at index O, and "elderberries" is at index 4, like this: 


“apples” “bananas” “cucumbers” “elderberries” 


index 0 index 1 index 2 index 3 index 4 


Data Structure Operations 


To understand the performance of any data structure—such as the array—we 
need to analyze the common ways our code might interact with that data 
structure. 


Many data structures are used in four basic ways, which we refer to as 
operations. These operations are: 


e Read: Reading refers to looking something up at a particular spot within 
the data structure. With an array, this means looking up a value ata 
particular index. For example, looking up which grocery item is located 
at index 2 would be reading from the array. 


e Search: Searching refers to looking for a particular value within a data 
structure. With an array, this means looking to see if a particular value 
exists within the array, and if so, at which index. For example, looking 
up the index of "dates" in our grocery list would be searching the array. 


e Insert: Insertion refers to adding a new value to our data structure. With 
an array, this means adding a new value to an additional slot within the 
array. If we were to add "figs" to our shopping list, we’d be inserting a new 
value into the array. 


e Delete: Deletion refers to removing a value from our data structure. With 
an array, this means removing one of the values from the array. For 
example, if we removed "bananas" from our grocery list, this value would 
be deleted from the array. 


In this chapter, we'll analyze how fast each of these operations are when 
applied to an array. 


Measuring Speed 


So, how do we measure the speed of an operation? 
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If you take away just one thing from this book, let it be this: when we measure 
how “fast” an operation takes, we do not refer to how fast the operation takes 
in terms of pure time, but instead in how many steps it takes. 


We've actually seen this earlier in the context of printing the even numbers 
from 2 to 100. The second version of that function was faster because it took 
half as many steps as the first version did. 


Why do we measure code’s speed in terms of steps? 


We do this because we can never say definitively that any operation takes, 
say, five seconds. While a piece of code may take five seconds on a particular 
computer, that same piece of code may take longer on an older piece of 
hardware. For that matter, that same code might run much faster on the 
supercomputers of tomorrow. Measuring the speed of an operation in terms 
of time is undependable, since the time will always change depending on the 
hardware it is run on. 


However, we can measure the speed of an operation in terms of how many 
computational steps it takes. If Operation A takes 5 steps, and Operation B 
takes 500 steps, we can assume that Operation A will always be faster than 
Operation B on all pieces of hardware. Measuring the number of steps is, 
therefore, the key to analyzing the speed of an operation. 


Measuring the speed of an operation is also known as measuring its time 
complexity. Throughout this book, I'll use the terms speed, time complexity, 
efficiency, performance, and runtime interchangeably. They all refer to the 
number of steps a given operation takes. 


Let’s jump into the four operations of an array and determine how many steps 
each one takes. 


Reading 


The first operation we'll look at is reading, which looks up what value is con- 
tained at a particular index inside the array. 


A computer can read from an array in just one step. This is because the 
computer has the ability to jump to any particular index in the array and 
peer inside. In our example of ["apples", "bananas", "cucumbers", "dates", "elderberries"], 
if we looked up index 2, the computer would jump right to index 2 and report 


that it contains the value "cucumbers". 


How is the computer able to look up an array’s index in just one step? Let’s 
see how. 
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A computer’s memory can be viewed as a giant collection of cells. In the fol- 
lowing diagram, you can see a grid of cells in which some are empty and some 
contain bits of data: 
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While this visual is a simplification of how computer memory works under 
the hood, it represents the essential idea. 


When a program declares an array, it allocates a contiguous set of empty 
cells for use in the program. So, if you were creating an array meant to hold 
five elements, your computer would find a group of five empty cells in a row 
and designate it to serve as your array: 
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Now, every cell in a computer's memory has a specific address. It’s sort of 
like a street address (for example, 123 Main St.), except that it’s represented 
with a number. Each cell’s memory address is one number greater than the 
previous cell’s address. Here’s a visual that shows each cell’s memory address: 


fn ono oo 
nn nis nos 
fen es 
sn fi 


i 02] 03s fou ono] 


sn nfs ss 
fff 
eof nf 
sf nfo no 
nw nfs of io 


In the next diagram, you can see our shopping list array with its indexes and 


memory addresses: 


“apples” “bananas” “cucumbers” “dates” “elderberries” 


memory 
address: 


index: 0 uy 2 3 4 


1010 1011 1012 1013 1014 


When the computer reads a value at a particular index of an array, it can 
jump straight to that index because of the combination of the following facts 
about computers: 


1. Acomputer can jump to any memory address in one step. For example, 
if you asked a computer to inspect whatever’s at memory address 1063, 
it can access that without having to perform any search process. As an 
analogy, if I ask you to raise your right pinky finger, you wouldn’t have 
to search all your fingers to find which one is your right pinky. You’d be 
able to identify it immediately. 
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2. Whenever a computer allocates an array, it also makes note at which 
memory address the array begins. So, if we asked the computer to find 
the first element of the array, it would be able to instantly jump to the 
appropriate memory address to find it. 


Now, these facts explain how the computer can find the first value of an array 
in a single step. However, a computer can also find the value at any index by 
performing simple addition. If we asked the computer to find the value at 
index 3, the computer would simply take the memory address at index 0 and 
add 3. (Memory addresses are sequential, after all.) 


Let’s apply this to our grocery list array. Our example array begins at memory 
address 1010. So, if we told the computer to read the value at index 3, the 
computer would go through the following thought process: 


1. The array begins with index 0, which is at memory address 1010. 
Index 3 will be exactly three slots past index 0. 

3. By logical extension, index 3 would be located at memory address 1013, 
since 1010 + 3is 1013. 


Once the computer knows that index 3 is at memory address 1013, it can 
jump right there and see that it contains the value "dates". 


Reading from an array is, therefore, an efficient operation, since the computer 
can read any index by jumping to any memory address in one step. Although 
I described the computer’s thought process by breaking it down into three 
parts, we are currently focusing on the main step of the computer jumping 
to a memory address. (In later chapters, we'll explore how to know which 
steps are the ones worth focusing on.) 


Naturally, an operation that takes just one step is the fastest type of operation. 
Besides being a foundational data structure, arrays are also a very powerful 
data structure because we can read from them with such speed. 


Now, what if instead of asking the computer what value is contained at index 
3, we flipped the question around and asked at what index "dates" can be 
found? That is the search operation, and we'll explore that next. 


Searching 


As I stated previously, searching an array means looking to see whether a 
particular value exists within an array and if so, at which index it’s located. 


In a sense, it’s the inverse of reading. Reading means providing the computer 
an index and asking it to return the value contained there. Searching, on the 
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other hand, means providing the computer a value and asking it to return 
the index of that value’s location. 


While these two operations sound similar, there’s a world of difference between 
them when it comes to efficiency. Reading from an index is fast, since a 
computer can jump immediately to any index and discover the value contained 
there. Searching, though, is tedious, since the computer has no way to jump 
to a particular value. 


This is an important fact about computers: a computer has immediate access 
to all of its memory addresses, but it has no idea offhand what values are 
contained at each memory address. 


Let’s take our earlier array of fruits and veggies, for example. The computer 
can’t immediately see the actual contents of each cell. To the computer, the 
array looks something like this: 


To search for a fruit within the array, the computer has no choice but to 
inspect each cell one at a time. 


The following diagrams demonstrate the process the computer would use to 
search for "dates" within our array. 


First, the computer checks index 0: 


4 


we | 2 | OD | OP | 2? 
| L E L 
2 3 4 


0 1 


Since the value at index 0 is "apples", and not the "dates" we’re looking for, the 
computer moves on to the next index as shown in the diagram on page 10. 
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“apples” “bananas” ? ? ? 
E E E 
1 2 3 4 


0 


Since index 1 doesn’t contain the "dates" we’re looking for either, the computer 


4 


moves on to index 2: 


“apples” “bananas” “cucumbers” 
1 


o 
Aha! We've found the elusive "dates", and now know that the "dates" are found 
at index 3. At this point, the computer does not need to move on to the next 
cell of the array, since it already found what we're looking for. 


In this example, because the computer had to check four different cells until 
it found the value we were searching for, we’d say that this particular operation 
took a total of four steps. 


In Why Algorithms Matter, you'll learn about another way to search an array, 


but this basic search operation—in which the computer checks each cell one 
at a time—is known as linear search. 


Now, what is the maximum number of steps a computer would need to perform 
to conduct a linear search on an array? 


If the value we're seeking happens to be in the final cell in the array (like "elder- 
berries"), then the computer would end up searching through every cell of the 
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array until it finally finds the value it’s looking for. Also, if the value we’re looking 
for doesn’t occur in the array at all, the computer likewise would have to search 
every cell so that it can be sure the value doesn’t exist within the array. 


So, it turns out that for an array of five cells, the maximum number of steps 
linear search would take is five. For an array of 500 cells, the maximum 
number of steps linear search would take is 500. 


Another way of saying this is that for N cells in an array, linear search would 
take a maximum of N steps. In this context, N is just a variable that can be 
replaced by any number. 


In any case, it’s clear that searching is less efficient than reading, since 
searching can take many steps, while reading always takes just one step no 
matter the size of the array. 


Next, we'll analyze the operation of insertion. 
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